Linear Model Selection and Regularization


Why Consider Alternatives to Least Squares?

  • Prediction Accuracy:

    • especially when $p > n4, to control the variance
  • Model Interpretability:

    • By removing irrelevant features — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted

    • We will present some approaches for automatically performing feature selection


Three Classes of Methods


Subset Selection

  • Best Subset Selection Procedures

    • Let \(M_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation

    • For \(k = 1,2,...p\)

      • Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors

      • Pick the best among these \(\binom{p}{k}\) models and call it \(M_k\)

        • Where best is defined as having the smallest RSS or equivalently largest \(R^2\)
    • Select a single best model from among \(M_0,...,M_p\) using cross-validation prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

  • Example: Credit Data Set


  • For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and \(R_2\) are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and \(R_2\). Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables

Extensions to Other Models

  • Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression

  • The deviance— negative two times the maximized log-likelihood— plays the role of RSS for a broader class of models


Stepwise Selection


Forward Stepwise Selection

  • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model

  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model

  • Computational advantage over best subset selection is clear

  • It is not guaranteed to find the best possible model out of all \(2^p\) models containing subsets of the \(p\) predictors

  • Forward Stepwise Selection

    • Let \(M_0\) denote the null model, which contains no predictors

    • For \(k = 0,...,p-1\):

      • Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor

      • Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having the smallest RSS or highest \(R_2\)

    • Select a single best model from among \(M_0,...,M_p\) using cross-validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\)

  • Example: Credit Data


  • The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ

Backward Stepwise Selection


Choosing the Optimal Model


Estimating Test Error: Two Approaches


\(C_p\), AIC, BIC< and Adjusted \(R^2\)

  • These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables

  • The figure below displays \(C_p\), BIC, and adjusted \(R^2\) for the best model of each size produced by the best subset selection on the Credit data set


  • Mallow’s \(C_p\):

    • \(C_p = \frac{1}{n}(RSS + 2d \hat{\sigma}^2)\)

      • Where \(d\) is the total number of parameters used and \(\hat{\sigma}^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement
  • AIC criterion is defined for a large class of models fit by maximum likelihood:

    • \(AIC = -2 log L + 2 \cdot d\)

      • Where \(L\) is the maximized value of the likelihood function for the estimated model
  • In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and \(C_p\) and AIC are equivalent

  • BIC:

    • \(BIC = \frac{1}{n}(RSS + log(n)d \hat{\sigma}^2)\)

      • Like \(C_p\), the BIC will end to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value

      • Notice that BIC replaces the \(2d\hat{\sigma}^2\) used by \(C_p\) with a \(log(n)d \hat{\sigma}^2\) term, where \(n\) is the number of observations

      • Since log \(n>2\) for an \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\)

  • Adjusted \(R^2\)

    • For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as:

      • Adjusted \(R^2\) = \(1 - \frac{RSS / (n-d-1)}{TSS / (n-1)}\)

        • Where TSS is the total sum of squares
      • Unlike \(C_p\), AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error

      • Maximizing the adjusted \(R^2\) is equivalent to minimizing \(\frac{RSS}{n-d-1}\). While RSS always decreases as the number of variables in the model increases, \(\frac{RSS}{n-d-1}\) may increase or decrease, due to the presence of \(d\) in the denominator

      • Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model


Validation and Cross-Validation


Shrinkage Methods


Ridge Regression

  • Recall, that the least squares fitting procedure estimates \(B_0, B_1,..., B_p\) using the values that minimize \(RSS = \sum_{i=1}^{n}(y_i - B_0 - \sum_{j=1}^{p}B_jX_{ij})^2\)

  • In contrast, the ridge regression coefficient estimates \(\hat{\beta}^R\) are the values that minimize \(\sum_{i=1}^{n}(y_i - B_0 - \sum_{j=1}^{p}B_jX_{ij})^2\) + \(\lambda \sum_{j=1}^{p} \beta_j^2\) = \(RSS + \lambda \sum_{j=1}^{p} \beta_j^2\)

    • Where \(\lambda \geq 0\) is a tuning parameter to be determined separately
  • As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small

  • However, the second term, \(\lambda \sum_{j=1}^{p} \beta_j^2\), called a shrinkage penalty, is small when \(B_1,...,B_p\) are close to zero, and so it has the effect of shrinking the estimates of \(\beta_j\) towards zero

  • The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates

  • Selecting a good value for \(\lambda\) is critical; cross-validation is used for this


Ridge Regression: Scaling of Predictors

  • The standard least squares coefficient estimates are scale equivariant: multiplying \(X_j\) by a constant \(c\) simply leads to a scaling of the least squares coefficient estimates by a factor of \(1/c\). In other words, regardless of how the \(j_th\) predictor is scaled, \(X_j \hat{\beta}_j\) will remain the same

  • In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function

  • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula:

    • \(\tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)^2}}\)

Why Does Ridge Regression Improve Over Least Squares?

  • The Bias-Variance tradeoff


  • Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\) and \(||\hat{\beta}_\lambda^R||2/||\hat{\beta}||_2\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest

The Lasso


The Lasso: Continued

  • As with ridge regression, the lasso shrinks the coefficient estimates towards zero

  • However, in the case of the lasso, the \(\ell_1\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large

  • Hence, much like best subset selection, the lasso performs variable selection

  • We say that the lasso yields sparse models — that is, models that involve only a subset of the variables

  • As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical; cross-validation is again the method of choice



The Variable Selection Property of the Lasso

  • Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?

  • One can show that the lasso and ridge regression coefficient estimates solve the problems

    • \(\underset{\beta}{minimize} \sum_{i=1}^{n}\left ( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2\)

      • Subject to \(\sum_{j=1}^{p}|\beta_j| \leq s\)
    • And

    • \(\underset{\beta}{minimize} \sum_{i=1}^{n}\left ( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2\)

      • Subject to \(\sum_{j=1}^{p}\beta_j^2 \leq s\)
    • Respectively

  • The Lasso


  • Comparing the Lasso and Ridge Regression


  • Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their \(R^2\) on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest

Conclusions


Selecting the Tuning Parameter for Ridge Regression and Lasso


Dimension Reduction Methods


Principal Components Regression


Partial Least Squares


Summary